Overview

This report is to recording the update of COSMIC data sets, and also some primary analysis results.
Version: R version 3.1.3 (2015-03-09)
Markdown file: Report_COSMIC_Data.Rmd
Last update of Markdown file: 2015-10-04

Format of data summary and sample barcode

Sample barcode

Format: [Project]-[Participant]-[Gender]-[Smoking]-[Diagnosis]
Example: COSMIC_1101_M_N_H

Table 1. Sample Barcode
Label Identifier.for Value Value.Description Possible.Values
Project Project name COSMIC COSMIC project COSMIC
Participant Study participant 1101 The sample with ID 1101. Any alpha-numeric value
Gender Gender M Male M: male;
F: female.
Smoking The smoking history of participant N non-smoker N: non-smoker;
S: smoker;
E: ex-smoker.
Diagnosis Diagnosis of COPD H Healthy H: healthy;
C: COPD.

Data platform Barcode

Format: [Cell]-[Omic]-[Platform]
Example: BAL_P_DIGE

Table 2. Data Platform Barcode
Label Identifier.for Value Value.Description Possible.Values
Cell Cell types BAL Data from BAL cells. BAL; BALF; Exosomes; BEC; PLASMA
Omic Omic study P Proteome P: Proteome;
T: Transcriptome;
L: Lipidome;
M: Metabolome
Platform Technology used to generate omic data DIGE Protein expression by DIGE platform HLA-typing; DIGE; iTRAQ; miR; mRNA; oxylipins; cys-LT; Global metab

Information of raw dataset

Table 3. Update of COSMIC data files (2016-06-14)
item fname sheet nrows scol ecol log normalization note X
datasummary ./rawdata/Data block Incl Excl_summary_2014-11-29.xlsx data 119 12 NA NA NA Deleted sample COSMIC_ID 3208 3218
clinic ./rawdata/COSMIC clinical_2016-03-31_selected.xlsx data 140 NA 124 NA NA NA
BAL_P_DIGE ./rawdata/BAL_proteomics_DIGE_log10_ratio_2012.xlsx data 406 2 78 log10 ratio NA
BAL_P_iTRAQ_ONE ./rawdata/BALcell_proteomics_iTRAQ_2016-05.xlsx dataONE 1809 2 70 log2 ratio need missing value imputation Proteins detected in at least 75% of the subjects in at least one group included (6 groups; Healthy, smoker and COPD stratified by gender)
BAL_P_iTRAQ_ALL ./rawdata/BALcell_proteomics_iTRAQ_2016-05.xlsx dataALL 940 2 70 log2 ratio need missing value imputation Proteins detected in at least 75% of the subjects in all main group included (6 groups; Healthy, smoker and COPD stratified by gender)
BAL_T_miR ./rawdata/BAL_miR_excl QC outl_quantile_log2_2011-10-23.xlsx data 896 43 128 log2 Quantile NA
BAL_T_mRNA ./rawdata/BAL_T_mRNA_raw.csv data 41002 6 56 log2 Quantile From ./rawdata/BAL_mRNA_All excl clinical_quantile_log2_2012-01-24.xlsx
BEC_T_miR ./rawdata/BEC_miR_Hsa_quantile_log2_66 samples_63 COSMIC_2014-12-11.xlsx data 896 43 108 log2 Quantile NA
EXO_T_miR ./rawdata/EXO_miR_quantileAll_log2_64 subjects.xlsx data 1214 43 106 log2 Quantile NA
miRNA_annot_v3.6 ./rawdata/miRNA-all-v3.6-8x15k-annotation.xlsx NA NA NA NA NA NA From ./rawdata/miRAnnotation/miRNA-all-v3.6-8x15k-annotation.txt
BALF_M_Oxylip ./rawdata/Serum_BALF_oxylipins_final incl LODs_Balgoma.xlsx Oxylip_BALF 46 5 124 NULL NULL
Serum_M_Oxylip ./rawdata/Serum_BALF_oxylipins_final incl LODs_Balgoma.xlsx Oxylip_serum 75 5 124 NULL NULL
Oxylip_annot ./rawdata/Serum_BALF_oxylipins_final incl LODs_Balgoma.xlsx KEGG IDS 41 1 8 NULL NULL
Serum_M_Non_targeted ./rawdata/Serum_metabolomics_3 platform_Shama_2016-05.xlsx data 1104 9 144 NULL NULL
Non_targeted_annot ./rawdata/Serum_metabolomics_3 platform_Shama_2016-05.xlsx data 1104 1 8
Serum_M_Biocrates ./rawdata/Serum_metabolomics_3 platform_Shama_2016-05.xlsx Biocrates result 79 3 186 NULL NULL samples in rows, need to transpose
Serum_M_Kynurenine ./rawdata/Serum_metabolomics_3 platform_Shama_2016-05.xlsx Kynurenine pathway 118 3 6 NULL NULL samples in rows, need to transpose
Serum_M_Sphingolipid ./rawdata/Serum_metabolomics_3 platform_Shama_2016-05.xlsx Sphingolipid analysis 116 3 31 NULL NULL samples in rows, need to transpose
BEC_P_TMT ./rawdata/Proteomics_TMT_BECs_20160604_FINAL.xlsx data 1137 2 91 log2 ratio
clinic_bioconductor ./rawdata/COSMIC clinical_2016-03-31_selected.csv Null NA NA NA

Data Summary

All data are transformed to log2 based values.

Table 4 Summary of COSMIC data raw files (2016-06-14)
item nrow/features ncol/sample unique_features missing_values data type
HLA_typing 32 118 32 0 integer
BAL_T_mRNA 41000 51 41000 0 numeric
BAL_P_DIGE 404 77 108 0 numeric
BAL_T_miR 880 86 880 0 numeric
BEC_T_miR 880 63 880 0 numeric
EXO_T_miR 1212 64 1212 0 numeric
BALF_M_Oxylip 45 114 45 0 numeric
Serum_M_Oxylip 74 115 74 0 numeric
Serum_M_Non_targeted 1103 116 1103 0 numeric
Serum_M_Biocrates 182 76 182 0 numeric
Serum_M_Kynurenine 4 115 4 0 numeric
Serum_M_Sphingolipid 29 115 29 0 numeric
BAL_P_iTRAQ_ONE_impute 1266 69 1266 0 numeric
BAL_P_iTRAQ_ALL_impute 939 69 939 0 numeric
BEC_P_TMT_impute 1136 90 1136 0 numeric
Figure 1. The histogram for samples by platforms.
Red, green, yellow and blue class tags represent samples in groups of Non-smoking Healthy, Smoking Healthy, Smoking COPD and Ex-smoking COPD respectively.
Figure 2. Summary of Multi-omics Data from COSMIC Project.
Each row represents a platform, and each column is a sample. Dark and light blue cells shows the data is available or not. Red, green, yellow and blue class tags represent samples in groups of Non-smoking Healthy, Smoking Healthy, Smoking COPD and Ex-smoking COPD respectively. Black and grey in Gender bar show female and male respectively. MF: alveolar macrophages; BEC: bronchial epithelial cell; BAL: bronchoalveolar lavage; EXO: exosomes from bronchoalveolar lavage fluid (BALF). The number in brackets is how many samples are tested in each platform. The barcode of platform see “Data platform Barcode” part.

 

Figure 3. Correlation map between different platforms
The concentration of color is positive correlated with the percent of samples tested by both platforms. The values are number of samples tested by the corresponding platform or both platforms.

Boxplot of all platform

Data Summary

Before importing to R

  1. Copy sheet Data block Incl Excl_QC excl to sheet data
  2. Delet contents and finally as in sheet data ### After importing to R
  3. Using datasummary_format.R to formated barcode and 4 groups in cgroup column as NH, SH, SC and EC.

Clinical Data 2016-03-31

Before importing to R

  1. Add sheet data and copy content from COSMIC clinical 2016-03-31 to it;
  2. Replace empty cell as NA (n=388); replace na as NA(n=541)
  3. Duplicate the first row Subject ID to COSMIC ID at row 2;
  4. Copy the first column to the front, replace space to ***_, rename Smoking_status*** to Smoking and ***Diagnosis_(Healthy=1,__COPD=2)*** to Diagnosis.

clinical data for bioconductor prepared by Vincenzo (cosmic$clinic_bioconductor)

The code is in preprocessing.R. After input COSMIC clinical_2016-03-31_selected.csv, a column variable named barcode is added.

After importing to R

  1. Name column name as datasummary$barcode
  2. Use the first column as rowname, then delet the first column

Clinical Data

Before importing to R

  1. Add sheet data and copy content from clinical_2015… to it;
  2. Replace empty cell as NA (n=248); replace na as NA(n=684); Delete column after DT
  3. Replace patient no to COSMIC ID at A184;
  4. Copy the first column to the front, replace space to , rename Smoking to Smoking and Doctor_ to Doctor.

After importing to R

  1. Rename duplicated row names to “rawname” + 2
  2. Name column name as datasummary$barcode
  3. Use the first column as rowname, then delet the first column

BAL_T_mRNA

Before importing to R

  1. Copy sheet “mRNA_all subj o genes_quant.xls” to “data”
  2. Delete row 2 to 16; Delet column B, G to K; Delet column without expression values (empty cells)
  3. Insert a copy of first row as the second row. The first item was renamed as “COSMIC_ID”; formatted the COSMIC ID.
  4. No missing values from column E to BD
  5. Number of row = 41002, start column = 6
  6. Save sheet “data” to .csv file ./rawdata/BAL_T_mRNA_raw.csv

BAL_P_DIGE

Before importing to R

  1. Copy pure data to new sheet “data”, column E, Q to CO
  2. Insert a copy of first row as the second row. The first item was renamed as “COSMIC_ID”

After importing to R

  1. Convert the value from the original log10 tranformation to log2 transformation

BAL_P_iTRAQ

Before importing to R

  1. copy sheet iTRAQ_BALc_75p ALL_ratio_log2 to dataALL
  2. Delete row 2 to 4; delete column B to S; for row 1 from column B to end replace “_log2*" to empty and formated cell as number.
  3. replace all empty cell to “NA”. n=1599
  4. copy sheet iTRAQ_BALc_75per_ONE_ratio_log2 to dataONE
  5. Delete row 2 to 4; delete column B to S; for row 1 from column B to end formated cell as number.
  6. replace all empty cell to “NA”. n=20652

After importing to R

  1. BAL_P_iTRAQ_ALL and BAL_P_iTRAQ_ONE correspond to step 1 and 3 respectively.
  2. Test different KNN imputation parameters based on BAL_P_iTRAQ_ONE
  3. The test from KNN imputation is not good, still impute with K = 10. The report is in “./doc/Report_Imputation_BAL_iTRAQ_CXL160525.docx” using iTRAQ_impute.R

BAL_T_miR

Before importing to R

  1. Copy sheet “HsamiR_quantle_BAL_excl QCoutl” to “data”
  2. Insert a copy of first row as the second row.
  3. Fill the “input_info.xlsx” par[“BAL_T_miR”,“scol”]<-43, starting column of expression data.
  4. Change the cell of H2 (row 2, column aveA) to 1.00 ### After importing to R
  5. Name rownames as column “Probe.Sequence”.
  6. Delete column 1 to par[“BAL_T_miR”,“scol”]<-43 -1, except column aveA which is moved to the last column.
  7. Formated columnames as formated COSMIC ID, delete the columns without formated COSMIC ID.

BEC_T_miR

Before importing to R

  1. Copy sheet “Hsa_miR_Quantile” to “data”.
  2. Delete all column not sample expression except Prob.sequence and aveA

After importing to R

  1. Name rownames as column “Probe.Sequence”.
  2. Delete column 1 to par[“BEC_T_miR”,“scol”]<-43 -1, except column aveA which is moved to the last column.
  3. Formated columnames as formated COSMIC ID, delete the columns without formated COSMIC ID.

EXO_T_miR

Before importing to R

  1. Copy sheet “EXO_miR_quantileAll_log2_66 sub” to “data”.
  2. Insert a copy of first row as the second row.
  3. Fill the “input_info.xlsx” par[“EXO_T_miR”,“scol”]<-43, starting column of expression data.
  4. Change the cell of H2 (row 2, column aveA) to 1.00. ### After importing to R
  5. Name rownames as column “Probe.Sequence”.
  6. Delete column 1 to par[“EXO_T_miR”,“scol”]<-43 -1, except column aveA which is moved to the last column.
  7. Formated columnames as formated COSMIC ID, delete the columns without formated COSMIC ID.

BALF_M_Oxylip

Before importing to R

  1. Replace empty cell to “NA”,n=226 ### After importing to R
  2. Using “SecID” as rownames
  3. Delete column/sample with all NA values
  4. Imputation of missing values as 1/3 of LLOQ of special features

Serum_M_Oxylip

Before importing to R

  1. Replace empty cell to “NA”,n=222 ### After importing to R
  2. Using “SecID” as rownames
  3. Delete column/sample with all NA values
  4. Imputation of missing values as 1/3 of LLOQ of special features

Oxylip_annot

Before importing to R

  1. Fill A1=SecID, B1=Symbol
  2. Replace empty cell, “-” to “NA” ### After importing to R
  3. First row as head/columnames
  4. Using “SecID” as rownames

Serum_M_Non_targeted

Before importing to R

  1. Copy sheet “Non-targeted(Q-Ex)” to data with transpose
  2. Correct COS_1101_2206b to COS_1101; COS_2201_2210b to COS_2201
  3. Delete “COS_”" in row 1

After importing to R

  1. Convert the raw non-log transformed value to log2 transformation

Non_target annot

Before importing to R

  1. The same as Serum_M_Non_targeted, from Serum_metabolomics_3 platform_Shama_2016-05.xlsx sheet “data”/“Non-targeted(Q-Ex)”

After importing to R

  1. The first 8 columns are extracted
  2. Merge conversionTable by Vincenzo into it, by adding a new column “KEGG”

Serum_M_Biocrates, Serum_M_Kynurenine, Serum_M_Sphingolipid

Before importing to R

  1. Replace “<LLOQ” and empty cell as “NA”

After importing to R

  1. Transpose data into row as features and column as samples
  2. Filter features with only NA or 0
  3. Imputation of missing values as 1/3 of LLOQ of special features for Serum_M_Kynurenine

BEC_P_TMT

Before importing to R

  1. Copy sheet **TMT_BEC_75%_All_log2_ratio** to data
  2. In sheet data, delete column B to S, delete row 1, 3 and 4
  3. Replace empty cell as “NA”, n = 2234

After importing to R

  1. Transfer first row to COSMIC barcode as column name; use first column as row name